A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures

نویسندگان

Qi Liu

Yang Liu

Maosong Sun

چکیده

刘奇,刘洋,孙茂松 (清华大学计算机科学与技术系智能技术与系统国家重点实验室,北京 100084) 摘要: 平行语料库是对机器翻译、跨语言信息检索等应用技术具有重要支撑作用的基础数据资源。虽然互联网上的平行网页数量巨大且持续增长,但由于平行网站的异构性和复杂性,如何快速自动获取高质量的平行网页进而构造平行语料库仍然是巨大的挑战。本文提出了一种 URL 模式与 HTML 结构相结合的平行网页获取方法,首先利用 HTML结构实现平行网页的递归访问,其次使用 URL模式优化遍历平行网站的拓扑顺序, 从而实现高效准确的平行网页获取。在联合国与香港政府 1 两个平行网站上的实验表明,我们的方法相对传统获取方法在获取时间上减少 50%以上,准确率提高 15%,并显著提高了机器翻译的质量(BLEU 值分别提高 1.6 和 0.7 个百分点)。关键词:平行网页获取;平行语料库;URL 模式;HTML 结构

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

XML: URL Data Set Creation for Future Web Mining Research Avenues

The rapid expansion of the internet has made web a popular place for disseminating and collecting information and also it opens many research topics on varies research fields. Since last few years, several attempts have been made on Web based research particularly based on HTML web pages because of its more availability. So that many Research Data sets have created and few of them are made avai...

متن کامل

Use of Semantic Similarity and Web Usage Mining to Alleviate the Drawbacks of User-Based Collaborative Filtering Recommender Systems

One of the most famous methods for recommendation is user-based Collaborative Filtering (CF). This system compares active user’s items rating with historical rating records of other users to find similar users and recommending items which seems interesting to these similar users and have not been rated by the active user. As a way of computing recommendations, the ultimate goal of the user-ba...

متن کامل

Parallel Sentences Mining From The Web

Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences includ...

متن کامل

GWUM : une généralisation des pages Web guidée par les usages

The usage analysis of a Web site based on the extracted sequential patterns is often limited by the low support of these patterns. That is mainly due to the great diversity of the pages and behaviors. However, it is possible to group the majority of these pages in various categories during a preprocessing. Then, using these categories, rather than the URL, will allow us to discover "generic" be...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures

نویسندگان

چکیده

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

XML: URL Data Set Creation for Future Web Mining Research Avenues

Use of Semantic Similarity and Web Usage Mining to Alleviate the Drawbacks of User-Based Collaborative Filtering Recommender Systems

Parallel Sentences Mining From The Web

GWUM : une généralisation des pages Web guidée par les usages

عنوان ژورنال:

اشتراک گذاری